For the take-home part of the MSDS 401 Final Exam, you are tasked with analyzing data on new daily covid-19 cases and deaths in European Union (EU) and European Economic Area (EEA) countries. A data file may be downloaded here, or you may use the provided read.csv() code in the ‘setup’ code chunk below to read the data directly from the web csv. Either approach is acceptable; the data should be the same.
Once you have defined a data frame with the daily case and death and country data, you are asked to: (1) perform an Exploratory Data Analysis (EDA), (2) perform some hypothesis testing, (3) perform some correlation testing, and (4) fit and describe a linear regression model. Each of these four (4) items is further explained below and “code chunks” have been created for you in which to add your R code, just as with the R and Data Analysis Assignments. You may add additional code chunks, as needed. You should make comments in the code chunks or add clarifying text between code chunks that you think further your work.
A data dictionary for the dataset is available here.
“Incidence rate” is equal to new daily cases per 100K individuals. Country population estimates can be found in ‘popData2020.’ You will calculate a daily incidence rate in item (1), for each country, that we will explore further in items (2) and (3).
“Fatality rate” is equal to new daily deaths per 100K individuals. Country population estimates can be found in ‘popData2020.’ You will calculate a daily fatality rate in item (1), for each country, that we will explore further in items (2) and (3).
Perform an Exploratory Data Analysis (EDA). Your EDA is exactly that: yours. Your knit .html should include the visualizations and summary tables that you find valuable while exploring this dataset. However, at minimum, your EDA must include the following:
#create vectors
data$Incidence_rate <- (data$cases/data$popData2020)*100000
data$Fatality_rate <- (data$deaths/data$popData2020) *100000
head(data)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-23 23 10 2022 3557 0 Austria AT
## 2 2022-10-22 22 10 2022 5494 4 Austria AT
## 3 2022-10-21 21 10 2022 7776 4 Austria AT
## 4 2022-10-20 20 10 2022 8221 6 Austria AT
## 5 2022-10-19 19 10 2022 10007 8 Austria AT
## 6 2022-10-18 18 10 2022 13204 7 Austria AT
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1 AUT 8901064 39.96151 0.00000000
## 2 AUT 8901064 61.72296 0.04493845
## 3 AUT 8901064 87.36034 0.04493845
## 4 AUT 8901064 92.35974 0.06740767
## 5 AUT 8901064 112.42476 0.08987690
## 6 AUT 8901064 148.34182 0.07864228
summary(data)
## dateRep day month year
## Min. :2020-01-01 Min. : 1.00 Min. : 1.000 Min. :2020
## 1st Qu.:2020-10-17 1st Qu.: 8.00 1st Qu.: 4.000 1st Qu.:2020
## Median :2021-06-17 Median :16.00 Median : 6.000 Median :2021
## Mean :2021-06-17 Mean :15.68 Mean : 6.431 Mean :2021
## 3rd Qu.:2022-02-13 3rd Qu.:23.00 3rd Qu.: 9.000 3rd Qu.:2022
## Max. :2022-10-26 Max. :31.00 Max. :12.000 Max. :2022
##
## cases deaths countriesAndTerritories geoId
## Min. :-348846 Min. : -217.00 Finland : 1024 FI : 1024
## 1st Qu.: 111 1st Qu.: 0.00 France : 1006 FR : 1006
## Median : 705 Median : 5.00 Czechia : 1003 CZ : 1003
## Mean : 6088 Mean : 40.87 Lithuania: 997 LT : 997
## 3rd Qu.: 3483 3rd Qu.: 31.00 Germany : 992 DE : 992
## Max. : 501635 Max. :13743.00 Sweden : 982 SE : 982
## NA's :93 NA's :292 (Other) :22725 (Other):22725
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## FIN : 1024 Min. : 38747 Min. :-518.189 Min. : -0.32234
## FRA : 1006 1st Qu.: 2095861 1st Qu.: 2.574 1st Qu.: 0.00000
## CZE : 1003 Median : 6951482 Median : 13.119 Median : 0.08189
## LTU : 997 Mean :15348035 Mean : 42.502 Mean : 0.26944
## DEU : 992 3rd Qu.:11522440 3rd Qu.: 41.035 3rd Qu.: 0.32034
## SWE : 982 Max. :83166711 Max. :3785.420 Max. :128.21679
## (Other):22725 NA's :93 NA's :292
str(data)
## 'data.frame': 28729 obs. of 12 variables:
## $ dateRep : Date, format: "2022-10-23" "2022-10-22" ...
## $ day : int 23 22 21 20 19 18 17 16 15 14 ...
## $ month : int 10 10 10 10 10 10 10 10 10 10 ...
## $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ cases : int 3557 5494 7776 8221 10007 13204 9964 6606 8818 11751 ...
## $ deaths : int 0 4 4 6 8 7 8 12 6 10 ...
## $ countriesAndTerritories: Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ geoId : Factor w/ 30 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ countryterritoryCode : Factor w/ 30 levels "AUT","BEL","BGR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ popData2020 : int 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 ...
## $ Incidence_rate : num 40 61.7 87.4 92.4 112.4 ...
## $ Fatality_rate : num 0 0.0449 0.0449 0.0674 0.0899 ...
# omit NAs
data = na.omit(data)
summary(data)
## dateRep day month year
## Min. :2020-01-01 Min. : 1.00 Min. : 1.000 Min. :2020
## 1st Qu.:2020-10-16 1st Qu.: 8.00 1st Qu.: 4.000 1st Qu.:2020
## Median :2021-06-16 Median :16.00 Median : 6.000 Median :2021
## Mean :2021-06-16 Mean :15.68 Mean : 6.441 Mean :2021
## 3rd Qu.:2022-02-13 3rd Qu.:23.00 3rd Qu.: 9.000 3rd Qu.:2022
## Max. :2022-10-26 Max. :31.00 Max. :12.000 Max. :2022
##
## cases deaths countriesAndTerritories geoId
## Min. :-348846 Min. : -217.00 Finland : 1024 FI : 1024
## 1st Qu.: 119 1st Qu.: 0.00 France : 1006 FR : 1006
## Median : 723 Median : 5.00 Czechia : 1002 CZ : 1002
## Mean : 6149 Mean : 40.91 Lithuania: 997 LT : 997
## 3rd Qu.: 3554 3rd Qu.: 31.00 Germany : 992 DE : 992
## Max. : 501635 Max. :13743.00 Sweden : 982 SE : 982
## (Other) :22341 (Other):22341
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## FIN : 1024 Min. : 38747 Min. :-518.189 Min. : -0.32234
## FRA : 1006 1st Qu.: 2095861 1st Qu.: 2.577 1st Qu.: 0.00000
## CZE : 1002 Median : 6951482 Median : 13.184 Median : 0.08189
## LTU : 997 Mean :15449398 Mean : 42.152 Mean : 0.27001
## DEU : 992 3rd Qu.:11522440 3rd Qu.: 40.979 3rd Qu.: 0.32111
## SWE : 982 Max. :83166711 Max. :3785.420 Max. :128.21679
## (Other):22341
str(data)
## 'data.frame': 28344 obs. of 12 variables:
## $ dateRep : Date, format: "2022-10-23" "2022-10-22" ...
## $ day : int 23 22 21 20 19 18 17 16 15 14 ...
## $ month : int 10 10 10 10 10 10 10 10 10 10 ...
## $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ cases : int 3557 5494 7776 8221 10007 13204 9964 6606 8818 11751 ...
## $ deaths : int 0 4 4 6 8 7 8 12 6 10 ...
## $ countriesAndTerritories: Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ geoId : Factor w/ 30 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ countryterritoryCode : Factor w/ 30 levels "AUT","BEL","BGR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ popData2020 : int 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 ...
## $ Incidence_rate : num 40 61.7 87.4 92.4 112.4 ...
## $ Fatality_rate : num 0 0.0449 0.0449 0.0674 0.0899 ...
## - attr(*, "na.action")= 'omit' Named int [1:385] 1433 1434 1436 1440 1442 1446 1808 1928 1931 1932 ...
## ..- attr(*, "names")= chr [1:385] "1433" "1434" "1436" "1440" ...
We found some values of cases and deaths have negative value, so we decide to filter these negative value out
# filter cases > 0, death > 0
data = data %>%
filter(cases >0) %>%
filter(deaths >0)
# A visualization exploring new cases, per country, over time.
top_mean_cases <- data %>%
group_by(countriesAndTerritories) %>%
summarise_at(vars(cases), list(name = mean)) %>%
arrange(desc(name)) %>%
slice_head(n = 7) %>%
pull(countriesAndTerritories)
top_mean_cases
## [1] France Germany Italy Spain Greece Netherlands
## [7] Poland
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_cases <- data %>%
filter(countriesAndTerritories %in% top_mean_cases)
head(top_countries_cases)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-24 24 10 2022 6907 122 France FR
## 2 2022-10-23 23 10 2022 31470 17 France FR
## 3 2022-10-22 22 10 2022 43746 33 France FR
## 4 2022-10-21 21 10 2022 49087 81 France FR
## 5 2022-10-20 20 10 2022 56793 69 France FR
## 6 2022-10-19 19 10 2022 63031 106 France FR
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1 FRA 67320216 10.25992 0.18122342
## 2 FRA 67320216 46.74673 0.02525244
## 3 FRA 67320216 64.98197 0.04901945
## 4 FRA 67320216 72.91569 0.12032047
## 5 FRA 67320216 84.36247 0.10249521
## 6 FRA 67320216 93.62864 0.15745642
p <- ggplot(top_countries_cases, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = cases, color = countriesAndTerritories)) +
geom_line() +
xlab("")+
theme_classic()+
labs(title = "New Cases From 2020 to 2022 ",
x = "Date",
y = "New Cases")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
### In the new cases plot, France and the Netherlands have the highest
number of cases in 2022, indicating the covid-19 outbreak in 2022 in
these countries.
top_mean_incidence <- data %>%
group_by(countriesAndTerritories) %>%
summarise_at(vars(Incidence_rate), list(name = mean)) %>%
arrange(desc(name)) %>%
slice_head(n = 7) %>%
pull(countriesAndTerritories)
top_mean_incidence
## [1] Iceland Cyprus Greece Liechtenstein Estonia
## [6] Slovenia Denmark
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_incidence <- data %>%
filter(countriesAndTerritories %in% top_mean_incidence)
head(top_countries_incidence)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-20 20 10 2022 2755 2 Cyprus CY
## 2 2022-10-13 13 10 2022 2759 2 Cyprus CY
## 3 2022-10-06 6 10 2022 2789 5 Cyprus CY
## 4 2022-09-29 29 9 2022 2681 2 Cyprus CY
## 5 2022-09-22 22 9 2022 2932 2 Cyprus CY
## 6 2022-09-15 15 9 2022 2482 5 Cyprus CY
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1 CYP 888005 310.2460 0.2252240
## 2 CYP 888005 310.6964 0.2252240
## 3 CYP 888005 314.0748 0.5630599
## 4 CYP 888005 301.9127 0.2252240
## 5 CYP 888005 330.1783 0.2252240
## 6 CYP 888005 279.5029 0.5630599
p <- ggplot(top_countries_incidence, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = Incidence_rate, color = countriesAndTerritories)) +
geom_line() +
xlab("")+
theme_classic()+
labs(title = "Incidence Rate From 2020 to 2022 ",
x = "Date",
y = "Incidence Rate")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
p <- ggplot(top_countries_incidence, aes(x = countriesAndTerritories, y = Incidence_rate))+
geom_boxplot()
p
As we can see, there are many extreme outliers in the boxplot. Let’s try
log transformation
p <- ggplot(top_countries_incidence, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = log10(Incidence_rate), color = countriesAndTerritories)) +
geom_line() +
xlab("")+
theme_classic()+
labs(title = "Incidence Rate From 2020 to 2022 ",
x = "Date",
y = "Log Incidence Rate")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
p <- ggplot(top_countries_incidence, aes(x = countriesAndTerritories, y = log10(Incidence_rate)))+
geom_boxplot()
p
Log Transformation doesn’t work because now we have a lot of extremely
small outliers. Try to inspect countries individually:
top_mean_incidence
## [1] Iceland Cyprus Greece Liechtenstein Estonia
## [6] Slovenia Denmark
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
Cyprus <- data %>%
filter(countriesAndTerritories == 'Cyprus')
p <- ggplot(Cyprus, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = Incidence_rate)) +
geom_line(color = 'blue') +
xlab("")+
theme_classic()+
labs(title = " Cyprus Incidence Rate From 2020 to 2022 ",
x = "Date",
y = "Incidence Rate")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
p <- ggplot(Cyprus, aes(x = countriesAndTerritories, y = Incidence_rate))+
geom_boxplot()
p
Find extreme outliers:
quantile(Cyprus$Incidence_rate, probs = 0.75)
## 75%
## 204.2218
upper = 69.96019+3*IQR(Cyprus$Incidence_rate)
extreme_outliers = Cyprus[Cyprus$Incidence_rate>upper,]
extreme_outliers
## dateRep day month year cases deaths countriesAndTerritories geoId
## 13 2022-07-28 28 7 2022 6863 16 Cyprus CY
## 14 2022-07-21 21 7 2022 10152 13 Cyprus CY
## 15 2022-07-14 14 7 2022 15386 7 Cyprus CY
## 16 2022-07-07 7 7 2022 14914 4 Cyprus CY
## 17 2022-06-30 30 6 2022 10879 3 Cyprus CY
## 18 2022-06-23 23 6 2022 7263 2 Cyprus CY
## 25 2022-05-05 5 5 2022 9559 12 Cyprus CY
## 27 2022-04-21 21 4 2022 6115 18 Cyprus CY
## 45 2022-03-28 28 3 2022 6332 3 Cyprus CY
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 13 CYP 888005 772.8560 1.8017917
## 14 CYP 888005 1143.2368 1.4639557
## 15 CYP 888005 1732.6479 0.7882838
## 16 CYP 888005 1679.4950 0.4504479
## 17 CYP 888005 1225.1057 0.3378359
## 18 CYP 888005 817.9008 0.2252240
## 25 CYP 888005 1076.4579 1.3513437
## 27 CYP 888005 688.6222 2.0270156
## 45 CYP 888005 713.0590 0.3378359
summary(extreme_outliers)
## dateRep day month year
## Min. :2022-03-28 Min. : 5.00 Min. :3.000 Min. :2022
## 1st Qu.:2022-05-05 1st Qu.:14.00 1st Qu.:5.000 1st Qu.:2022
## Median :2022-06-30 Median :21.00 Median :6.000 Median :2022
## Mean :2022-06-12 Mean :19.67 Mean :5.778 Mean :2022
## 3rd Qu.:2022-07-14 3rd Qu.:28.00 3rd Qu.:7.000 3rd Qu.:2022
## Max. :2022-07-28 Max. :30.00 Max. :7.000 Max. :2022
##
## cases deaths countriesAndTerritories geoId
## Min. : 6115 Min. : 2.000 Cyprus :9 CY :9
## 1st Qu.: 6863 1st Qu.: 3.000 Austria :0 AT :0
## Median : 9559 Median : 7.000 Belgium :0 BE :0
## Mean : 9718 Mean : 8.667 Bulgaria:0 BG :0
## 3rd Qu.:10879 3rd Qu.:13.000 Croatia :0 CZ :0
## Max. :15386 Max. :18.000 Czechia :0 DE :0
## (Other) :0 (Other):0
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## CYP :9 Min. :888005 Min. : 688.6 Min. :0.2252
## AUT :0 1st Qu.:888005 1st Qu.: 772.9 1st Qu.:0.3378
## BEL :0 Median :888005 Median :1076.5 Median :0.7883
## BGR :0 Mean :888005 Mean :1094.4 Mean :0.9760
## CZE :0 3rd Qu.:888005 3rd Qu.:1225.1 3rd Qu.:1.4640
## DEU :0 Max. :888005 Max. :1732.6 Max. :2.0270
## (Other):0
table(extreme_outliers$month)
##
## 3 4 5 6 7
## 1 1 1 2 4
table(extreme_outliers$year)
##
## 2022
## 9
The extreme outliers in Cyprus happened mainly during the winter of 2022.
# A visualization exploring fatality rate, per country, over time.
top_mean_fatality <- data %>%
group_by(countriesAndTerritories) %>%
summarise_at(vars(Fatality_rate), list(name = mean)) %>%
arrange(desc(name)) %>%
slice_head(n = 7) %>%
pull(countriesAndTerritories)
top_mean_fatality
## [1] Liechtenstein Greece Iceland Hungary Bulgaria
## [6] Slovakia Croatia
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_fatality <- data %>%
filter(countriesAndTerritories %in% top_mean_incidence)
head(top_countries_fatality)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-20 20 10 2022 2755 2 Cyprus CY
## 2 2022-10-13 13 10 2022 2759 2 Cyprus CY
## 3 2022-10-06 6 10 2022 2789 5 Cyprus CY
## 4 2022-09-29 29 9 2022 2681 2 Cyprus CY
## 5 2022-09-22 22 9 2022 2932 2 Cyprus CY
## 6 2022-09-15 15 9 2022 2482 5 Cyprus CY
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1 CYP 888005 310.2460 0.2252240
## 2 CYP 888005 310.6964 0.2252240
## 3 CYP 888005 314.0748 0.5630599
## 4 CYP 888005 301.9127 0.2252240
## 5 CYP 888005 330.1783 0.2252240
## 6 CYP 888005 279.5029 0.5630599
p <- ggplot(top_countries_fatality, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = Fatality_rate, color = countriesAndTerritories)) +
geom_line() +
xlab("")+
theme_classic()+
labs(title = "Fatality Rate From 2020 to 2022 ",
x = "Date",
y = "New Cases")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
Greece has an extremely high fatality rate in 2021.
p <- ggplot(top_countries_fatality, aes(x = countriesAndTerritories, y = Fatality_rate))+
geom_boxplot()
p
p <- ggplot(top_countries_fatality, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = log10(Fatality_rate), color = countriesAndTerritories)) +
geom_line() +
xlab("")+
theme_classic()+
labs(title = "Fatality Rate From 2020 to 2022 ",
x = "Date",
y = "Log Fatality Rate")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
p <- ggplot(top_countries_fatality, aes(x = countriesAndTerritories, y = log10(Fatality_rate)))+
geom_boxplot()
p
# A table or visualization exploring some other aspect of the data.
str(data)
## 'data.frame': 20518 obs. of 12 variables:
## $ dateRep : Date, format: "2022-10-22" "2022-10-21" ...
## $ day : int 22 21 20 19 18 17 16 15 14 13 ...
## $ month : int 10 10 10 10 10 10 10 10 10 10 ...
## $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ cases : int 5494 7776 8221 10007 13204 9964 6606 8818 11751 13068 ...
## $ deaths : int 4 4 6 8 7 8 12 6 10 14 ...
## $ countriesAndTerritories: Factor w/ 30 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ geoId : Factor w/ 30 levels "AT","BE","BG",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ countryterritoryCode : Factor w/ 30 levels "AUT","BEL","BGR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ popData2020 : int 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 8901064 ...
## $ Incidence_rate : num 61.7 87.4 92.4 112.4 148.3 ...
## $ Fatality_rate : num 0.0449 0.0449 0.0674 0.0899 0.0786 ...
## - attr(*, "na.action")= 'omit' Named int [1:385] 1433 1434 1436 1440 1442 1446 1808 1928 1931 1932 ...
## ..- attr(*, "names")= chr [1:385] "1433" "1434" "1436" "1440" ...
# A visualization exploring new cases, per country, over time.
top_mean_deaths <- data %>%
group_by(countriesAndTerritories) %>%
summarise_at(vars(deaths), list(name = mean)) %>%
arrange(desc(name)) %>%
slice_head(n = 7) %>%
pull(countriesAndTerritories)
top_mean_deaths
## [1] Italy France Germany Poland Spain Greece Hungary
## 30 Levels: Austria Belgium Bulgaria Croatia Cyprus Czechia Denmark ... Sweden
top_countries_deaths <- data %>%
filter(countriesAndTerritories %in% top_mean_deaths)
head(top_countries_deaths)
## dateRep day month year cases deaths countriesAndTerritories geoId
## 1 2022-10-24 24 10 2022 6907 122 France FR
## 2 2022-10-23 23 10 2022 31470 17 France FR
## 3 2022-10-22 22 10 2022 43746 33 France FR
## 4 2022-10-21 21 10 2022 49087 81 France FR
## 5 2022-10-20 20 10 2022 56793 69 France FR
## 6 2022-10-19 19 10 2022 63031 106 France FR
## countryterritoryCode popData2020 Incidence_rate Fatality_rate
## 1 FRA 67320216 10.25992 0.18122342
## 2 FRA 67320216 46.74673 0.02525244
## 3 FRA 67320216 64.98197 0.04901945
## 4 FRA 67320216 72.91569 0.12032047
## 5 FRA 67320216 84.36247 0.10249521
## 6 FRA 67320216 93.62864 0.15745642
p <- ggplot(top_countries_deaths, aes(x = round_date(as.POSIXct(dateRep), "days"),
y = deaths, color = countriesAndTerritories)) +
geom_line() +
xlab("")+
theme_classic()+
labs(title = "Deaths From 2020 to 2022 ",
x = "Date",
y = "Deaths")+
theme(plot.title = element_text(hjust = 0.5, size = 15))
p
#explore case fatality rates per country
case_fatality_rate_summary <- data %>%
group_by(countriesAndTerritories) %>%
summarise(case_fatality_rate = sum(deaths)/sum(cases)) %>%
arrange(desc(case_fatality_rate))
case_fatality_rate_summary
## # A tibble: 30 × 2
## countriesAndTerritories case_fatality_rate
## <fct> <dbl>
## 1 Liechtenstein 0.0425
## 2 Bulgaria 0.0300
## 3 Hungary 0.0224
## 4 Romania 0.0207
## 5 Poland 0.0192
## 6 Croatia 0.0138
## 7 Ireland 0.0132
## 8 Malta 0.0105
## 9 Czechia 0.0101
## 10 Slovakia 0.00828
## # ℹ 20 more rows
#A visualization exploring case fatality rates per country
ggplot(case_fatality_rate_summary, aes(x = reorder(countriesAndTerritories, -case_fatality_rate), y = case_fatality_rate, fill = countriesAndTerritories)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Case Fatality Rates by Country",
x = "Country",
y = "Case Fatality Rate ") +
theme_minimal() +
scale_fill_viridis_d()
Select two (2) countries of your choosing and compare their incidence or fatality rates using hypothesis testing. At minimum, your work should include the following:
Incidence Rate
# Incidence Rate
# Hypothesis Test for Difference in Means
selected_countries <- c("Spain", "Italy")
filtered_data <- data[data$countriesAndTerritories %in% selected_countries,]
ggplot(filtered_data, aes(x = Incidence_rate, fill = countriesAndTerritories)) +
geom_histogram(position = "dodge", binwidth = 20, alpha = 0.7) +
labs(title = "Incidence Rate Distributions with Confidence Intervals",
x = "Incidence rate",
y = "Frequency") +
facet_wrap(~ countriesAndTerritories) +
theme_minimal() +
theme(legend.position = "none")
Two-sided t-test Null Hypothesis (H0): There is no difference in the mean incidence rates between Spain and Italy.
Justification: We selected a two-sided t-test for comparing the mean incidence rates of Spain and Italy. This test is appropriate to compare the means of two independent samples.
Assumption Check: By visually inspecting the histograms of the incidence rates for Spain and Italy above, we can see the distributions of incidence rates in both Spain and Italy are not perfectly normally distributed while N > 30 in both countries, so the t-test is suitable for this analysis.
Selected Alpha: Our selected alpha level is 0.05, which means that we are willing to accept a 5% chance of committing a Type I error when rejecting the null hypothesis.
spain_data <- filtered_data[filtered_data$countriesAndTerritories == 'Spain',]
italy_data <- filtered_data[filtered_data$countriesAndTerritories == 'Italy',]
t.test(spain_data$Incidence_rate, italy_data$Incidence_rate, alternative = c("two.sided"), mu = 0, paired = FALSE, var.equal = TRUE, conf.level = 0.95)
##
## Two Sample t-test
##
## data: spain_data$Incidence_rate and italy_data$Incidence_rate
## t = -1.9531, df = 1789, p-value = 0.05097
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.64326657 0.02232826
## sample estimates:
## mean of x mean of y
## 34.98388 40.29435
conf_interval_spain <- round(t.test(spain_data$Incidence_rate)$conf.int, 2)
conf_interval_italy <- round(t.test(italy_data$Incidence_rate)$conf.int, 2)
print(paste("Confidence Interval for Spain: (", conf_interval_spain[1], ",", conf_interval_spain[2], ")."))
## [1] "Confidence Interval for Spain: ( 31.23 , 38.74 )."
print(paste("Confidence Interval for Italy: (", conf_interval_italy[1], ",", conf_interval_italy[2], ")."))
## [1] "Confidence Interval for Italy: ( 36.55 , 44.04 )."
Conclusion of the t-test: The p-value 0.05097 is slightly larger than 0.05, so we may fail to reject the null hypothesis there is no difference in the mean incidence rates between Spain and Italy.
combined_data_incidence <- rbind(
data.frame(Incidence_rate = spain_data$Incidence_rate, Country = 'Spain'),
data.frame(Incidence_rate = italy_data$Incidence_rate, Country = 'Italy')
)
ggplot(combined_data_incidence, aes(x = Incidence_rate, fill = Country)) +
geom_histogram(position = "dodge", binwidth = 20, alpha = 0.7) +
geom_vline(data = data.frame(Country = 'Spain', xintercept = conf_interval_spain), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
geom_vline(data = data.frame(Country = 'Italy', xintercept = conf_interval_italy), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
scale_color_manual(values = c('Spain' = 'darkorange', 'Italy' = 'darkorange')) +
labs(title = "Incidence Rate Distributions with Confidence Intervals",
x = "Incidence rate",
y = "Frequency") +
facet_wrap(~ Country) +
theme_minimal() +
theme(legend.position = "none")
# Hypothesis Test for Variances
selected_countries <- c("Spain", "Italy")
filtered_data <- data[data$countriesAndTerritories %in% selected_countries,]
ggplot(filtered_data, aes(x = countriesAndTerritories, y = Incidence_rate, fill = countriesAndTerritories)) +
geom_boxplot() +
labs(title = "Incidence Rate: Spain vs. Italy",
x = "Country",
y = "Incidence Rate") +
theme_minimal()
ANOVA Test The null hypothesis (H0): There are no significant differences in the mean incidence rates between Spain and Italy.
Justification: We selected the Analysis of Variance (ANOVA) test to assess whether there are differences in the mean incidence rates between Spain and Italy because ANOVA test is appropriate for comparing means across multiple groups, in this case, the two countries Spain and Italy.
two_countries_data <- data %>%
filter(countriesAndTerritories %in% c("Spain", "Italy"))
bartlett.test(Incidence_rate ~ countriesAndTerritories, two_countries_data)
##
## Bartlett test of homogeneity of variances
##
## data: Incidence_rate by countriesAndTerritories
## Bartlett's K-squared = 6.1314, df = 1, p-value = 0.01328
Assumption Check: From the box-plots above, we can see that the two countries have similar spreads, which could support the assumption of homogeneity of variances. However, from the bartlett test, we can see that the p value 0.01328 is small, so we should reject the null hypothesis and have evidence to support the alternative hypothesis that the variances within the two countries are not equal. Violations of normality and homogeneity of variances may affect the accuracy of the results, but ANOVA is still robust especially with larger sample sizes in both countries.
Selected Alpha: Our selected alpha level is 0.05.
anova_result <- aov(Incidence_rate ~ countriesAndTerritories, two_countries_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## countriesAndTerritories 1 12535 12535 3.815 0.051 .
## Residuals 1789 5878832 3286
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion of the Statistical Test: The p-value 0.051 is slightly larger than 0.05, so we may fail to reject the null hypothesis that there are no significant differences in the mean incidence rates between Slovenia and France.
Fatality Rate
selected_countries <- c("Spain", "Italy")
filtered_data <- data[data$countriesAndTerritories %in% selected_countries,]
ggplot(filtered_data, aes(x = Fatality_rate, fill = countriesAndTerritories)) +
geom_histogram(position = "dodge", binwidth = 0.1, alpha = 0.7) +
labs(title = "Fatality Rate Distributions with Confidence Intervals",
x = "Fatality rate",
y = "Frequency") +
facet_wrap(~ countriesAndTerritories) +
theme_minimal() +
theme(legend.position = "none")
Two-sided t-test Null Hypothesis (H0): There is no difference in the mean fatality rates between Spain and Italy.
Justification: We selected a two-sided t-test for comparing the mean fatality rates of Spain and Italy. This test is appropriate to compare the means of two independent samples.
Assumption Check: By visually inspecting the histograms of the fatality rates for Spain and Italy above, we can see the distributions of fatality rates in both Spain and Italy are not perfectly normally distributed while N > 30 in both countries, so the t-test is suitable for this analysis.
Selected Alpha: Our selected alpha level is 0.05, which means that we are willing to accept a 5% chance of committing a Type I error when rejecting the null hypothesis.
T-test
spain_data <- filtered_data[filtered_data$countriesAndTerritories == 'Spain',]
italy_data <- filtered_data[filtered_data$countriesAndTerritories == 'Italy',]
t.test(spain_data$Fatality_rate, italy_data$Fatality_rate, alternative = c("two.sided"), mu = 0, paired = FALSE,
var.equal = TRUE, conf.level = 0.95)
##
## Two Sample t-test
##
## data: spain_data$Fatality_rate and italy_data$Fatality_rate
## t = -1.2741, df = 1789, p-value = 0.2028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05104086 0.01084147
## sample estimates:
## mean of x mean of y
## 0.2880119 0.3081116
conf_interval_spain_fa <- round(t.test(spain_data$Fatality_rate)$conf.int, 2)
conf_interval_italy_fa <- round(t.test(italy_data$Fatality_rate)$conf.int, 2)
print(paste("Confidence Interval for Spain: (", conf_interval_spain_fa[1], ",", conf_interval_spain_fa[2], ")."))
## [1] "Confidence Interval for Spain: ( 0.27 , 0.31 )."
print(paste("Confidence Interval for Italy: (", conf_interval_italy_fa[1], ",", conf_interval_italy_fa[2], ")."))
## [1] "Confidence Interval for Italy: ( 0.29 , 0.33 )."
Conclusion of the t-test: Given the p-value (0.2028) is greater than the conventional alpha level of 0.05, we fail to reject the null hypothesis. This implies there is no statistically significant difference in the mean daily fatality rates between Spain and Italy, based on the provided data.
# Calculate confidence intervals
calculate_confidence_interval <- function(data, confidence_level = 0.95) {
n <- length(data)
mean_value <- mean(data)
sd_value <- sd(data)
alpha <- 1 - confidence_level
critical_value <- qnorm(1 - alpha / 2)
margin_of_error <- critical_value * (sd_value / sqrt(n))
return(c(mean_value - margin_of_error, mean_value + margin_of_error))
}
# Create a combined dataframe for plotting
combined_data <- rbind(
data.frame(Fatality_rate = spain_data$Fatality_rate, Country = 'Spain'),
data.frame(Fatality_rate = italy_data$Fatality_rate, Country = 'Italy')
)
# Create histograms with confidence intervals
ggplot(combined_data, aes(x = Fatality_rate, fill = Country)) +
geom_histogram(position = "dodge", binwidth = 0.1, alpha = 0.7) +
geom_vline(data = data.frame(Country = 'Spain', xintercept = conf_interval_spain_fa), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
geom_vline(data = data.frame(Country = 'Italy', xintercept = conf_interval_italy_fa), aes(xintercept = xintercept, color = Country), linetype = "dashed") +
scale_color_manual(values = c('Spain' = 'orange', 'Italy' = 'orange')) +
labs(title = "Fatality Rate Distributions with Confidence Intervals",
x = "Fatality rate",
y = "Frequency") +
facet_wrap(~ Country) +
theme_minimal() +
theme(legend.position = "none")
ANOVA Test The null hypothesis (H0): There are no significant differences in the mean fatality rates between Spain and Italy.
Justification: We selected the Analysis of Variance (ANOVA) test to assess whether there are differences in the mean fatality rates between Spain and Italy because ANOVA test is appropriate for comparing means across multiple groups, in this case, the two countries Spain and Italy.
# Box Plot for variance comparison
ggplot(filtered_data, aes(x = countriesAndTerritories, y = Fatality_rate, fill = countriesAndTerritories)) +
geom_boxplot() +
labs(title = "Variance of Fatality Rate: Spain vs Italy",
x = "Country",
y = "Fatality Rate") +
theme_minimal()
two_countries_data <- data %>%
filter(countriesAndTerritories %in% c("Spain", "Italy"))
bartlett.test(Fatality_rate ~ countriesAndTerritories, two_countries_data)
##
## Bartlett test of homogeneity of variances
##
## data: Fatality_rate by countriesAndTerritories
## Bartlett's K-squared = 0.19495, df = 1, p-value = 0.6588
Assumption Check: The box plots above show that the two countries have similar spreads, which could support the concept of variance homogeneity. We should not reject the null hypothesis because the p-value of 0.6588 is greater than the standard alpha level of 0.05. Because the Bartlett test indicates that the variances are not statistically different, it validates the use of such tests. Violations of normality and variance homogeneity may impair the accuracy of the results, but ANOVA remains robust, particularly with higher sample sizes in both countries.
Selected Alpha: Our selected alpha level is 0.05.
# ANOVA test
anova_result <- aov(Fatality_rate ~ countriesAndTerritories, data = filtered_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## countriesAndTerritories 1 0.18 0.1796 1.623 0.203
## Residuals 1789 197.90 0.1106
Conclusion of the Statistical Test: The p-value (0.203) is greater than the standard alpha level (0.05), indicating that we fail to reject the null hypothesis. This suggests there is no statistically significant difference in both countries.
Considering all countries, explore the relationship between incidence rates and fatality rates. At minimum, your work should include the following:
ggplot(data, aes(x = Incidence_rate)) +
geom_histogram(bins = 35, fill = "lightblue", alpha = 1.2) +
labs(title = "Distribution of Daily Incidence Rates", x = "Incidence Rate", y = "Frequency")
# Histogram for Daily Fatality Rates
ggplot(data, aes(x = Fatality_rate)) +
geom_histogram(bins = 30, fill = "purple", alpha = 1.2) +
labs(title = "Distribution of Daily Fatality Rates", x = "Fatality Rate", y = "Frequency")
data <- na.omit(data)
pearson_corr <- cor(data$Incidence_rate, data$Fatality_rate, method = "pearson")
print(paste("Pearson Correlation Coefficient:", pearson_corr))
## [1] "Pearson Correlation Coefficient: 0.0950094002591344"
spearman_corr <- cor(data$Incidence_rate, data$Fatality_rate, method = "spearman")
print(paste("Spearman Correlation Coefficient:", spearman_corr))
## [1] "Spearman Correlation Coefficient: 0.56195982704042"
Answer: (From the distribution of daily incidence rate and fatality rate, two data trends don’t really follow normal distributions, The distribution of incidence rates may be skewed, indicating that most countries have lower incidence rates with fewer examples of very high rates. Fatality Rate is similiar to incidence rates.The Spearman correlation would be more fitted for the non-normal distribution, which can provide a more reliable measure between incidence and fatality rates under these conditions.)
Here, we will fit a model on data from twenty (20) countries considering total new cases as a function of population, population density and gross domestic product (GDP) per capita. Note that the GDP per capita is given in “purchasing power standard,” which considers the costs of goods and services in a country relative to incomes in that country; i.e. we will consider this as appropriately standardized.
Code is given below defining a new data frame, ‘model_df,’ which provides the total area and standardized GDP per capita for the twenty (20) countries for our model fit. You are responsible for creating a vector of the total new cases across the time frame of the dataset, for each of those countries, and adding that vector to our ’model_df” data frame.
# The code below creates a new data frame, 'model_df,' that includes the area,
# GDP per capita, population and population density for the twenty (20)
# countries of interest. All you should need to do is execute this code, as is.
# You do not need to add code in this chunk. You will need to add code in the
# 'regression_b,' 'regression_c' and 'regression_d' code chunks.
twenty_countries <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Denmark",
"Finland", "France", "Germany", "Hungary", "Ireland",
"Latvia", "Lithuania", "Malta", "Norway", "Poland",
"Portugal", "Romania", "Slovakia", "Spain", "Sweden")
sq_km <- c(83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386, 93030,
70273, 64589, 65300, 316, 385178, 312685, 88416, 238397, 49036,
498511, 450295)
gdp_pps <- c(128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100, 142,
71, 78, 65, 71, 91, 120)
model_df <- data %>%
select(c(countriesAndTerritories, popData2020)) %>%
filter(countriesAndTerritories %in% twenty_countries) %>%
distinct(countriesAndTerritories, .keep_all = TRUE) %>%
add_column(sq_km, gdp_pps) %>%
mutate(pop_dens = popData2020 / sq_km) %>%
rename(country = countriesAndTerritories, pop = popData2020)
Next, we need to add one (1) more column to our ‘model_df’ data frame. Specifically, one that has the total number of new cases for each of the twenty (20) countries. We calculate the total number of new cases by summing all the daily new cases, for each country, across all the days in the dataset.
### The following code will be removed for students to complete the work themselves.
total_cases <- data %>%
select(c(countriesAndTerritories, cases)) %>%
group_by(countriesAndTerritories) %>%
dplyr::summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
filter(countriesAndTerritories %in% twenty_countries) %>%
select(total_cases)
model_df <- model_df %>%
add_column(total_cases)
model_df
## country pop sq_km gdp_pps pop_dens total_cases
## 1 Austria 8901064 83858 128 106.14448 5387997
## 2 Belgium 11522440 30510 118 377.66109 4597870
## 3 Bulgaria 6951482 110994 51 62.62935 1260132
## 4 Cyprus 888005 9251 91 95.99016 519761
## 5 Denmark 5822763 44493 129 130.86919 3163687
## 6 Finland 5525292 338145 111 16.34001 1305112
## 7 France 67320216 551695 104 122.02434 36942541
## 8 Germany 83166711 357386 123 232.70836 35284167
## 9 Hungary 9769526 93030 71 105.01479 2138099
## 10 Ireland 4964440 70273 190 70.64506 607291
## 11 Latvia 1907675 64589 69 29.53560 902622
## 12 Lithuania 2794090 65300 81 42.78851 1241739
## 13 Malta 514564 316 100 1628.36709 76852
## 14 Norway 5367580 385178 142 13.93532 549922
## 15 Poland 37958138 312685 71 121.39418 6157508
## 16 Portugal 10295909 88416 78 116.44848 5505030
## 17 Romania 19328838 238397 65 81.07836 3245139
## 18 Slovakia 5457873 49036 71 111.30339 2475792
## 19 Spain 47332614 498511 91 94.94798 13561646
## 20 Sweden 10327589 450295 120 22.93516 2585009
Now, we will fit our model using the data in ‘model_df.’ We are interested in explaining total cases (response) as a function of population (explanatory), population density (explanatory), and GDP (explanatory).
At minimum, your modeling work should including the following:
# Description of model_df
str(model_df)
## 'data.frame': 20 obs. of 6 variables:
## $ country : Factor w/ 30 levels "Austria","Belgium",..: 1 2 3 5 7 9 10 11 13 15 ...
## $ pop : int 8901064 11522440 6951482 888005 5822763 5525292 67320216 83166711 9769526 4964440 ...
## $ sq_km : num 83858 30510 110994 9251 44493 ...
## $ gdp_pps : num 128 118 51 91 129 111 104 123 71 190 ...
## $ pop_dens : num 106.1 377.7 62.6 96 130.9 ...
## $ total_cases: int 5387997 4597870 1260132 519761 3163687 1305112 36942541 35284167 2138099 607291 ...
## - attr(*, "na.action")= 'omit' Named int [1:385] 1433 1434 1436 1440 1442 1446 1808 1928 1931 1932 ...
## ..- attr(*, "names")= chr [1:385] "1433" "1434" "1436" "1440" ...
summary(model_df)
## country pop sq_km gdp_pps
## Austria : 1 Min. : 514564 Min. : 316 Min. : 51.0
## Belgium : 1 1st Qu.: 5266795 1st Qu.: 60701 1st Qu.: 71.0
## Bulgaria: 1 Median : 7926273 Median : 90723 Median : 95.5
## Cyprus : 1 Mean :17305840 Mean :192118 Mean :100.2
## Denmark : 1 3rd Qu.:13474040 3rd Qu.:342955 3rd Qu.:120.8
## Finland : 1 Max. :83166711 Max. :551695 Max. :190.0
## (Other) :14
## pop_dens total_cases
## Min. : 13.94 Min. : 76852
## 1st Qu.: 57.67 1st Qu.: 1156960
## Median : 100.50 Median : 2530400
## Mean : 179.14 Mean : 6375396
## 3rd Qu.: 121.55 3rd Qu.: 5417255
## Max. :1628.37 Max. :36942541
##
***Answer: country: Categorical variable represented as a factor with 20 distinct categories. pop: Population count for each country, formatted as an integer value. sq_km: The land area of each country in square kilometers, an integer value. gdp_pps: GDP per capita expressed in purchasing power standards, an integer that adjusts for varying prices and income levels across countries. pop_dens: Population density, an integer calculated by dividing the population by the land area in square kilometers. total_cases: The cumulative number of new daily COVID-19 cases reported in each country, recorded as an integer.
The dataset’s structure indicates that there are 20 records and 6 features. Each row in the data frame represents a summary of observations for different countries. The columns contain distinct attributes: the name of the country, its population, area in square kilometers, GDP per capita adjusted for purchasing power, population density, and the total number of COVID-19 cases.
model_df
## country pop sq_km gdp_pps pop_dens total_cases
## 1 Austria 8901064 83858 128 106.14448 5387997
## 2 Belgium 11522440 30510 118 377.66109 4597870
## 3 Bulgaria 6951482 110994 51 62.62935 1260132
## 4 Cyprus 888005 9251 91 95.99016 519761
## 5 Denmark 5822763 44493 129 130.86919 3163687
## 6 Finland 5525292 338145 111 16.34001 1305112
## 7 France 67320216 551695 104 122.02434 36942541
## 8 Germany 83166711 357386 123 232.70836 35284167
## 9 Hungary 9769526 93030 71 105.01479 2138099
## 10 Ireland 4964440 70273 190 70.64506 607291
## 11 Latvia 1907675 64589 69 29.53560 902622
## 12 Lithuania 2794090 65300 81 42.78851 1241739
## 13 Malta 514564 316 100 1628.36709 76852
## 14 Norway 5367580 385178 142 13.93532 549922
## 15 Poland 37958138 312685 71 121.39418 6157508
## 16 Portugal 10295909 88416 78 116.44848 5505030
## 17 Romania 19328838 238397 65 81.07836 3245139
## 18 Slovakia 5457873 49036 71 111.30339 2475792
## 19 Spain 47332614 498511 91 94.94798 13561646
## 20 Sweden 10327589 450295 120 22.93516 2585009
model_df[c(2,4:6)]
## pop gdp_pps pop_dens total_cases
## 1 8901064 128 106.14448 5387997
## 2 11522440 118 377.66109 4597870
## 3 6951482 51 62.62935 1260132
## 4 888005 91 95.99016 519761
## 5 5822763 129 130.86919 3163687
## 6 5525292 111 16.34001 1305112
## 7 67320216 104 122.02434 36942541
## 8 83166711 123 232.70836 35284167
## 9 9769526 71 105.01479 2138099
## 10 4964440 190 70.64506 607291
## 11 1907675 69 29.53560 902622
## 12 2794090 81 42.78851 1241739
## 13 514564 100 1628.36709 76852
## 14 5367580 142 13.93532 549922
## 15 37958138 71 121.39418 6157508
## 16 10295909 78 116.44848 5505030
## 17 19328838 65 81.07836 3245139
## 18 5457873 71 111.30339 2475792
## 19 47332614 91 94.94798 13561646
## 20 10327589 120 22.93516 2585009
library(kableExtra)
library(tidyr)
correl <- cor(model_df[c(2,4:6)])
correl %>%
kbl() %>%
kable_styling()
| pop | gdp_pps | pop_dens | total_cases | |
|---|---|---|---|---|
| pop | 1.0000000 | 0.0235758 | -0.0813254 | 0.9428576 |
| gdp_pps | 0.0235758 | 1.0000000 | 0.0205849 | 0.0918025 |
| pop_dens | -0.0813254 | 0.0205849 | 1.0000000 | -0.0493727 |
| total_cases | 0.9428576 | 0.0918025 | -0.0493727 | 1.0000000 |
pairs(model_df[c(2,4:6)], col = "blue", main = "Correlation Matrix")
### Correlation: There is a strong positive correlation between ‘pop’
and ‘total_cases’ (0.9428576), suggesting that countries with larger
populations tend to report more COVID-19 cases. Other variables may not
have a relationship with each other as the correlation value is near
zero.
model1 <- lm(total_cases ~ pop+gdp_pps+pop_dens, model_df)
summary(model1)
##
## Call:
## lm(formula = total_cases ~ pop + gdp_pps + pop_dens, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8441289 -1186145 55312 1722407 8943757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.444e+06 2.828e+06 -1.218 0.241
## pop 4.316e-01 3.729e-02 11.574 3.47e-09 ***
## gdp_pps 2.206e+04 2.596e+04 0.850 0.408
## pop_dens 7.848e+02 2.467e+03 0.318 0.755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3760000 on 16 degrees of freedom
## Multiple R-squared: 0.8945, Adjusted R-squared: 0.8747
## F-statistic: 45.22 on 3 and 16 DF, p-value: 4.881e-08
The produced model has adjust R^2 of 0.8747, we need to produce a parsimonious model, and to do so, we will need to drop some of the variables.
The method we choose to do this was with StepWise Regression AIC.
library(MASS)
stepAIC(model1)
## Start: AIC=609.13
## total_cases ~ pop + gdp_pps + pop_dens
##
## Df Sum of Sq RSS AIC
## - pop_dens 1 1.4304e+12 2.2760e+14 607.26
## - gdp_pps 1 1.0204e+13 2.3638e+14 608.01
## <none> 2.2617e+14 609.13
## - pop 1 1.8938e+15 2.1199e+15 651.89
##
## Step: AIC=607.26
## total_cases ~ pop + gdp_pps
##
## Df Sum of Sq RSS AIC
## - gdp_pps 1 1.0382e+13 2.3799e+14 606.15
## <none> 2.2760e+14 607.26
## - pop 1 1.8980e+15 2.1256e+15 649.94
##
## Step: AIC=606.15
## total_cases ~ pop
##
## Df Sum of Sq RSS AIC
## <none> 2.3799e+14 606.15
## - pop 1 1.9057e+15 2.1436e+15 648.11
##
## Call:
## lm(formula = total_cases ~ pop, data = model_df)
##
## Coefficients:
## (Intercept) pop
## -1.089e+06 4.313e-01
Based on the results of Stepwise Regression, the population variable has the lowest AIC value, indicating the best fit for the model. Therefore, we use the population variable to fit the model.
model2 <- lm(total_cases ~ pop, model_df)
summary(model2)
##
## Call:
## lm(formula = total_cases ~ pop, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9126129 -702074 608600 1214732 8993753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.089e+06 1.024e+06 -1.064 0.301
## pop 4.313e-01 3.593e-02 12.006 5.01e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3636000 on 18 degrees of freedom
## Multiple R-squared: 0.889, Adjusted R-squared: 0.8828
## F-statistic: 144.1 on 1 and 18 DF, p-value: 5.009e-10
Additionally, we include population and GDP in Model 3 as they are the second-best choice according to Stepwise Regression.
model3 <- lm(total_cases ~ pop+gdp_pps, model_df)
summary(model3)
##
## Call:
## lm(formula = total_cases ~ pop + gdp_pps, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8461112 -1323197 377232 1619448 8946779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.305e+06 2.719e+06 -1.216 0.241
## pop 4.306e-01 3.616e-02 11.906 1.13e-09 ***
## gdp_pps 2.224e+04 2.526e+04 0.881 0.391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3659000 on 17 degrees of freedom
## Multiple R-squared: 0.8938, Adjusted R-squared: 0.8813
## F-statistic: 71.56 on 2 and 17 DF, p-value: 5.263e-09
MSE comparison
mean(model2$residuals^2)
## [1] 1.189929e+13
mean(model3$residuals^2)
## [1] 1.138018e+13
par(mfrow=c(1,2))
hist(model2$residuals, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
#curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)
qqnorm(model2$residuals, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(model2$residuals, col = "green", lty = 2, lwd = 2)
par(mfrow=c(1,2))
hist(model3$residuals, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
#curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)
qqnorm(model3$residuals, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(model3$residuals, col = "green", lty = 2, lwd = 2)
The recommended model 2 with the Stepwise method includes population as parameters, but the R^2 seems to be slightly larger than the initial model. Although the Stepwise methos recommend includes population as parameters, we try to fit a model 3 with population and gdp as parameters.The model3 has a similar p value and adjust R^2 value. We believe DGP is a great predictor to predict the total case, so model 3 is the best model including population and GDP.
p1 <- ggplot(model_df, aes(x = pop, y = total_cases)) +
geom_point(color = "blue") +
ggtitle("Population vs Total Cases") +
theme_minimal()
p2 <- ggplot(model_df, aes(x = gdp_pps, y = total_cases)) +
geom_point(color = "purple") +
ggtitle("GDP vs Total Cases") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)
qq = ggplot(model_df, aes(sample = total_cases))+
geom_qq(color = "orange")+
geom_qq_line()+
labs(title = "Normality for response", x = "Theoretical Quantiles", y = "Sample Quantiles")
qq
Although the test of normality for the response variable is optional and doesn’t add value to the analysis. The test of normality for the residuals does.
We review the residuals closely and identify what predictors create issues if any and also if there are outliers. In this step, we compare the residuals of model 2 and model 3
r <- residuals(model3)
par(mfrow = c())
## named list()
hist(r,30, main = "Residuals distribution", col = "red")
plot(r,main = "Fitted Residuals", col = "red")
abline(h=0)
library(ggplot2)
library(moments)
x <- r
par(mfrow=c(1,2))
hist(r, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)
qqnorm(r, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(r, col = "green", lty = 2, lwd = 2)
r_2 <- residuals(model2)
x <- r_2
par(mfrow=c(1,2))
hist(r_2, main = "Histogram of Residuals", col = "red", freq = FALSE,ylab="Frequency")
curve(dnorm(x,0,sd(x)),add=TRUE, col="green", lwd = 2)
qqnorm(r_2, main = "Q-Q Plot of Residuals", col = "red", pch = 16)
qqline(r_2, col = "green", lty = 2, lwd = 2)
When comparing the residuals of model 2 and model 3, we found that the
distribution of model 2 and model 3 are normally distributed and most of
the points lay on the qq line. However, model 3 is the best-fitted model
because the residuals fit on the qq line better than model 2
A test on autocorrelation of residuals. The test shows autocorrelation of the points.
library(car)
dwt(model3)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.116894 2.209375 0.634
## Alternative hypothesis: rho != 0
In summary, the Durbin-Watson test result suggests that there is no significant autocorrelation in the residuals of the regression model model3, as the D-W statistic is close to 2, and the p-value is much higher than 0.05. Therefore, the null hypothesis of no autocorrelation cannot be rejected.
# detect residual pattern
residual_pop = ggplot(model_df,aes(x=pop, y=r))+
geom_point()+
labs(title = "Residuals versus population", x = "population", y = "Residuals" )
residual_gdp = ggplot(model_df,aes(x=gdp_pps, y=r))+
geom_point()+
labs(title = "Residuals versus GDP", x = "GDP", y = "Residuals" )
grid.arrange(residual_pop,residual_gdp,ncol = 2 )
The last thing we will do is use our model to predict the total new cases of two (2) countries not included in our model fit. At minimum, your work should include:
# The code below defines our 'newdata' data frame for applying our model to the
# population, population density and GDP per capita for two (2). Please execute
# the code as given.
newdata <- data.frame(country = c("Luxembourg", "Netherlands"),
pop = c(626108, 17407585),
gdp_pps = c(261, 130),
pop_dens = c(626108, 17407585) / c(2586, 41540))
# Add code here returning the actual total cases from our dataset for the
# Netherlands and Luxembourg.
actual_data <- data %>%
filter(countriesAndTerritories %in% c("Netherlands", "Luxembourg")) %>%
group_by(countriesAndTerritories) %>%
dplyr::summarize(total_cases = sum(cases, na.rm = TRUE))
print(actual_data)
## # A tibble: 2 × 2
## countriesAndTerritories total_cases
## <fct> <int>
## 1 Luxembourg 200079
## 2 Netherlands 8399426
# Add code here returning the total cases for the Netherlands and Luxembourg
# predicted by our model.
# Predicting total cases using the model
predicted_cases <- predict(model3, newdata)
# Adding predictions to the newdata data frame
newdata$predicted_cases <- predicted_cases
# Viewing the results
print(newdata)
## country pop gdp_pps pop_dens predicted_cases
## 1 Luxembourg 626108 261 242.1145 2769977
## 2 Netherlands 17407585 130 419.0560 7082063
Since the distribution of our dependent variable – total_cases – is skewed to the left, we want to see if making it more normally distributed can make the residuals more normally distributed thus improving the model’s predictive power.
The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. As we can see from the histogram before and after log transformation, the data is significantly less skewed after transformation.
log_model = model_df
log_model = log_model[c(1:2,4:6)]
par(mfrow = c(1,2))
hist(log_model$total_cases, main = "Total Cases Before Log Transformation", xlab = "Total Cases", col= "lightblue")
hist(log(log_model$total_cases), main = "Total Cases After Log Transformation", xlab = "Log Total Cases", col = "lightblue")
Correlation table after Log Transformation: we can observe that the correlations are different from before the transformation.
correl <- cor(log_model[2:5])
correl %>%
kbl() %>%
kable_styling()
| pop | gdp_pps | pop_dens | total_cases | |
|---|---|---|---|---|
| pop | 1.0000000 | 0.0235758 | -0.0813254 | 0.9428576 |
| gdp_pps | 0.0235758 | 1.0000000 | 0.0205849 | 0.0918025 |
| pop_dens | -0.0813254 | 0.0205849 | 1.0000000 | -0.0493727 |
| total_cases | 0.9428576 | 0.0918025 | -0.0493727 | 1.0000000 |
pairs(log_model[c(2:5)], col = "blue", main = "Correlation Matrix")
Step AIC:
model_log = lm(total_cases~pop+gdp_pps+pop_dens, log_model)
stepAIC(model_log)
## Start: AIC=609.13
## total_cases ~ pop + gdp_pps + pop_dens
##
## Df Sum of Sq RSS AIC
## - pop_dens 1 1.4304e+12 2.2760e+14 607.26
## - gdp_pps 1 1.0204e+13 2.3638e+14 608.01
## <none> 2.2617e+14 609.13
## - pop 1 1.8938e+15 2.1199e+15 651.89
##
## Step: AIC=607.26
## total_cases ~ pop + gdp_pps
##
## Df Sum of Sq RSS AIC
## - gdp_pps 1 1.0382e+13 2.3799e+14 606.15
## <none> 2.2760e+14 607.26
## - pop 1 1.8980e+15 2.1256e+15 649.94
##
## Step: AIC=606.15
## total_cases ~ pop
##
## Df Sum of Sq RSS AIC
## <none> 2.3799e+14 606.15
## - pop 1 1.9057e+15 2.1436e+15 648.11
##
## Call:
## lm(formula = total_cases ~ pop, data = log_model)
##
## Coefficients:
## (Intercept) pop
## -1.089e+06 4.313e-01
Step AIC chose population and population density for our model. However, the R squared and p-value is not as good as model3.
model_log1 = lm(total_cases~pop+pop_dens, log_model)
summary(model_log1)
##
## Call:
## lm(formula = total_cases ~ pop + pop_dens, data = log_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9099198 -566886 273430 1310045 8990131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.256e+06 1.159e+06 -1.084 0.293
## pop 4.324e-01 3.697e-02 11.696 1.49e-09 ***
## pop_dens 8.321e+02 2.446e+03 0.340 0.738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3729000 on 17 degrees of freedom
## Multiple R-squared: 0.8897, Adjusted R-squared: 0.8768
## F-statistic: 68.58 on 2 and 17 DF, p-value: 7.259e-09
We want to check the normality of residuals, the shapiro test shows that model 3 barely passed the normality check while the log model performs better in terms of residual normality.
shapiro.test(model_log1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_log1$residuals
## W = 0.86763, p-value = 0.01068
shapiro.test(model3$residuals)
##
## Shapiro-Wilk normality test
##
## data: model3$residuals
## W = 0.90771, p-value = 0.05766
Visualized comparison between the distribution of model3 and log model
par(mfrow = c(2,2))
hist(model_log1$residuals, main = "Log Model Residuals", col="lightblue")
qqnorm(model_log1$residuals)
hist(model3$residuals, main = "Model 3 Residuals", col = "lightpink")
qqnorm(model3$residuals)
From the histogram and qqplot we can conclude that the residuals of log
model indeed appears more “normal” than model3. However, the other
measurements such as R^2 and actual predicted value are not the best.
Since model3 passed the normality check and performs better in other
areas, model3 is still our best performing model.